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Abstract 

We show how to sample acyclic digraphs uniformly at random through recursive enumeration. This provides an exact 
method which avoids the convergence issues of the alternative Markov chain methods. The limiting behaviour of the 
distribution of acyclic digraphs also allows us to sample arbitrarily large acyclic digraphs. Finally we discuss how 
to include various restrictions in the combinatorial enumeration for efficient uniform sampling of the corresponding 
graphs. 
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1. Introduction 



Directed acyclic graphs (DAGs) are the basic repres entation of the structure underlying Bayesian networks, which 
in turn represent multivariate probability distributions dLauritzen , 119961: iNeapolitanl l2004h . They are largely used in 
many fields o f applied sta t istics with especially important applications in biostatistics, such as the learning of epistatic 
relationships (IJiang et al.L 1201 ih . The estimation of DAGs or their equivalence class is a hard problem a nd methods 



for th eir efficient reconstruction from data is a very active field of research: a recent review is given by iDalv et al 



20111 while some new methodological developments for estimating high dimensional sparse DAGs are discussed by 
Kalisch and Biihlmannl l2007t IColombo et al.l 1201 2i For simulation studies aimed at assessing the performance of 
learning algorithms which reconstruct a graph from data, it is crucial to be able to generate uniform samples from 
the space of DAGs so that any structure related bias is removed. The only currently available method relies on the 
construction of a Markov chain whose properties ensure that the Umiting distribution is u niform over all DAGs with 
a given number of vertices n. The strategy is based on a well known idea first suggested by Madi^an and YorkI ( 1995 ) 
as a Markov Chain Monte Carlo (MCMC) scheme in the context of Bayesian graphical models to sample from the 
posterior distrib ution of graphs conditio nal on the data. A specific algorithm for uniform sampling of DAGs was 
first provided by lMelan9on et al.l (120011) . with the advantage over the standard MCMC scheme of not requiring the 
evaluation of the sampled graphs' neighbourhood s i ze, at t he expense of a slower convergence. The method was 
later extended by Ide and CozmanI (2002); Ide et al. (2004); Melan9on and Philippe! (l2004i) to limit the sampling to 
restricted sets of DAGs. An R implementation was also recently provided bv IScutar il (l2010l) . Since Markov chain 
based algorithms pose non-negligible convergence and computational issues, in practice random upper or lower tri- 
angular adjacency matrices are often s ampled to generate ra ndom ensembles for simulation studies [as for example 
implemented in the pcalg R package of iKaUsch et al.l (l2012h l. This method however does not provide uniformly dis- 
tributed graphs on the space of DAGs and could for example perform poorly to obtain starting points for hill-cUmbing 
algorithms or slowly converging Markov chains by increasing the risk of remaining within a small neighbourhood of 
certain graphs and more inefficiently exploring the space. Likewise uniform sampling allows the correct evaluation 
of reconstructing algorithms. Finally, when evaluating the prevalence of certain features in a population, a uniform 
sample is essential. Here we therefore present a sampling strategy based on the recursive enumeration of DAGs but 
where no explicit listing is required. 
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2. Acyclic digraphs 

A directed graph or digraph on « vertices has up to one directed edge between each pair of vertices, from node 
i to node j with 1 < j < n. Acyclic digraphs or DAGs are those which admit no cycles so that there are no paths 
along the di rected edges fr om a node to itself. The total number a„ of labe lled DAGs with n nodes, which is sequ ence 
A003024 in lSloand (120121) . and its asymptotic behaviour are respectively jRobinson , Il97dll973l:lstanlevlll973h 



*=1 ^ ' 
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with L - (^2) = n(n - l)/2 an d where M - 0.574 . . . and q - 1.48 .. . are constants. The number of unlabelled DAGs 
was found later bv Robinsonl l 1977 ) and is sequence A003087 in Sloane (2012). 

The directed edges of a DAG can be represented as entries with value 1 of a vertex adjacency matrix with remaining 
en tries 0. Being acycUc the matrix can be made upper triangular by relabelling the nodes, which lies behind the proof 
of iMcKav et al.l ( 120041) that the number of « x « (0,l)-matrices with only positive real eigenvalues is also a„. Hence 
DAGs can be easily generated by sampling each of the L elements of an upper triangular matrix from a Bernoulli 
with probabihty p = 1/2 and then permuting the n labels, but they are non-uniform since several permutations may 
correspond to the same graph. For example, the DAG with no arcs (the matrix) would be counted «! times, while 
those with the most arcs (when all L upper triangular elements are 1) only once. Choosing p 1/2 may reduce the 
overall non-uniformity but because the overcounting is sensitive to the structure of the DAG, not remove it entirely. 

2.1. Markov chain method 



The algorithm of lMelan9on et al.l (1200 Ih starts with a given DAG, with no arcs say, and proceeds by repeatedly 
sampling an ordered pair of vertices from the « available. If an arc joins the pair in that direction it is deleted, otherwise 
it is added as long as the graph remains acyclic. Addition and deletion of an arc are inverse operations so that the 
resulting transition matrix T is symmetric. As adding an arc to any pair (/, /) would immediately create a cycle the 
probability to stay with the same graph is nonzero and at least 1 /«. One could exclude all but one such pair from 
the sampling to reduce this probability and marginally improve the convergence of the chain. Finally a path exists 
between any pair of DAGs since deleting all the arcs in turn leads to the empty DAG and the steps are reversible. 
These three properties ensure that the stationary distribution of the Markov chain is uniform over the space of graphs. 

The Markov chain method is very elegant, as no particular knowledge of DAGs is required apart from checking 
whether the graphs remain acyclic, which can be done in a ti me of the order between « and the number of edges 
(typically around n^) and on average in a time of order « log(n) (lAlon and RodehLll978llMelan9on et al.Ll200lh . The 
issue is how quickly the chain converges. The longest path is between a DAG with the most arcs (L) and the same 
graph with all the edges reversed. Therefore after 2L steps we obtain an irreducible transition matrix U = T^^, whose 
elements though are far from uniform. First removing all the arcs in L steps has a transition probability of Ll/rr^, and 
the same adding them back due to the symmetry. The L additions and deletions can be ordered in ("/li") ways, though 
the probability of different orderings is lower. This leads to a minimum transition probability < {2L)\/n'^ which is 
actually smaller than 1 /a^, much less than the required uniform value of \/a„. 

The difference between the maximum and minimum of the elements of [/* decays at least exponentially ~ 
exp(-a'A:) with a rate a depending on the largest eigenvalue of U below that at 1. The probability for any DAG 
with the most arcs to remain unchanged is 1 /2 H- 1 /2n so the maximum element of {/'''^ is at least 1 /2^. To approach 
the uniform distribution it should be reduced by a factor of around nl/q", in line with ([T]i, suggesting that k is at least 
of order n log(n)/Q'. Since the largest element of U is certainly < 1 the order of k is at most n^/a. Now what matters 
for the speed of convergence is how a depends on «. Proving lower bounds is extremely hard and a generally known 
one, given by twice the minimum entry of U, is not enough to ensure that the algorithm is useful in practice since it 
would require k much larger than a„ for convergence. For DAGs with 2 or 3 nodes T can be easily filled out and we 
find a = 0.575 and a - 0.594 respectively, and with a reduced probability of not moving, a = 0.8 1 1 and a - 0.776 in- 
stead. Its behaviour should be verified, but as long as a does not decrease asymptotically with n, a uniformly sampled 
DAG can be obtained by throwing away the first 0{n'^) steps of the Markov chain. Analogously 0(«*) steps should 
also be discarded between further samples to ensure they are independent. These conditions should be observed when 
using implementations like that injScutari, (.2010i) . with the entire algorithm then becoming 0{n^ log(n)) and limiting 
the size of graphs that can be sampled. 
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3. Enumeration method 



In order to obtain a uniform sample directly, we return to the enumeration of labelled DAGs detailed in lRobinson 
1977h . Nodes with no incoming arcs are called 'outpoints' and can be used to further classify and recursively 
enumerate DAGs, which have at least one since they are acyclic. Let a„ ^ be the number of labelled DAGs with n 
nodes of which k are outpoints (1 < k < n). Removing the latter and the arcs originating from them leaves a smaller 
DAG with m - n - k nodes, with s outpoints say. Reversing the process, the required DAGs can be built by adding k 
new outpoints and allowing all the possible connections from them to the previous nodes. For each of the m - s non- 
outpoint nodes there are 2'' possibilities of having an arc or not, while each of the s old outpoints must be connected 
to at least once, giving a factor of 2* - 1 . Finally the labels can be rearranged in ways, giving the recursions 
( lRobinsonl[l970[| 19771) 



iK,, ^., = £(2*-l)'2^('"-). 

^ S-l 



(2) 



with fl„_„ - 1 the DAGs with no arcs and b„^k auxiliary quantities useful later The total number of DAGs with n nodes 
is a„ = Tj"k=\ ^n,k and can be turned into the more compact form in ([T]i by means of the inclusion-exclusion principle. 
The separation in terms of outpoints however is the key for our strategy to sample labelled DAGs uniformly. 



3.1. Recursively generating the outpoints 

The first step in our algorithm is to compute all the integers a,, ^ (and b„^k) as well as the totals «„. From ([TJ it 
follows that the number of bits needed to store them grows like L, while 2n + 1 integers are needed for each « (since 
1 < ^ < «). Finally each of them is obtained by adding an order of n smaller integers (multiplied by the binomial 
function) so that the process becomes 0{n'^). However, they only need to be calculated once and then stored for 
sampling arbitrarily many DAGs. Later we also discuss how to avoid their explicit evaluation. 

The second step is, given n, to sample the number of outpoints. An integer is drawn between 1 and a,,, for example 
by sampling from a Bernoulli(i) the digits of a binary string long enough to represent a„ (if the resulting number is 
larger than a„ the string is rejected and resampled, which happens with probability less than a half). The chosen 
number of outpoints is the minimum k for which the partial sum a,, , is larger than the sampled integer 

The final DAG will be obtained by connecting the k outpoints to a smaller one of size m, whose number of 
outpoints can be taken to be the minimum s for which the partial sum 2iLi(2* - l)'2'"''""'*fl,„,, is larger than a randomly 
drawn integer between 1 and bu t. Similarly a DAG with m nodes and s outpoints can be sampled by drawing an 
integer between 1 and bi„ s. The number of outpoints kj removed at each step can so be recursively sampled until their 
Yj'i ki - n. In total we need to sample and manipulate I < n integers whose size in bits is at most of order L, so that 
generating the sequence is 0(n^). Though we proposed sampling and integer at each step, the original one between 1 
and a„ could be equally reused to obtain the entire sequence of outpoints. 

3.2. Reconstructing the DAG 

To represent the sampled DAG we fill a null « X n adjacency matrix from the top left to obtain a lower triangular 
matrix. Since the recursion ends when all the remaining nodes are outpoints, the first ki x ki elements stay zero. These 
must connect to at least one of newly added outpoints which must not connect to each other, hence the elements 
of each column from 1 to ki in rows k; + \ lo k; + kj^i can be sampled from a Bernoulli(^). Any samples where all 
the elements in a column are can be rejected columnwise (alternatively a direct binomial sampling with this case 
excluded can be performed). 

When adding an additional ^/_2 nodes, the elements in columns 1 to k; and rows fe/ + + 1 to kj + kj^i + ^/_2 
are again sampled from a Bernoulli(:|), while for each column kj + I to kj + kj^i we need to ensure that not all new 
entries are 0. Completing the adjacency matrix is a step of order n^ since up to L integers need to be sampled. The 
final sampled DAG can be obtained by permuting the labels only at the end, rather than at each step as suggested by 
(|2]i, since each permutation must be equally Hkely. Although different permutations may still correspond to the same 
DAG, drawing the number of outpoints at each step is weighted so that the resulting sample is perfectly uniform. 
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Table 1: The relative occurrence of the number of outpoints in large DAGs, At multiplied by lO'". Ag is approximately 2.2 X 10 so k > 7 can 
be excluded at this level of accuracy. 



A2 



5 743 623 733 
3 662 136 732 



A3 
A4 



564 645 435 
29 023 072 



As 
Afi 



566 517 
4 496 



Av 
As 



15 




3.3. Arbitrarily large DAGs 

The number of outpoints sampled at each step is small and most often just 1 . In fact the fraction of DAGs with 
k outpoints among n, A^ - lim„^co a„ kla„, converges to within 10"'" by n - 20. For n > 20, within the uniformity 
limits of 32 bit integer sampler, k could be directly sampled from the limiting distribution in Table[T] Given the initial 
k the next value s in the sequence of outpoints can be drawn from the limiting conditional probabilities 

"■" = (' ^' = ?('-?)"^- 

which follow from (|2]l. The A:-dependent weight further shifts the distribution towards low s so it would be computa- 
tionally meaningful to store the small number of relevant probabilities. The procedure is iterated until we are left with 
20 nodes or fewer, when the tabulated distributions are no longer accurate enough. For n < 20 one could tabulate the 
exact distributions or simply return to the enumeration method. Generating the sequence of outpoints is now of order 
n so that a uniform (at scales above 10 DAG can be sampled in 0{n^) allowing one to move to very large graphs. 

The accuracy of the convergence to the limiting distributions of A^ nearly exactly doubles when doubling n. To 
obtain uniformity on the scale of a 64 bit integer sampler we can use (a more accurately calculated distribution of) Ak 
for n > 40 and return to the exact enumeration for n < 40. 



4. Restrictions and variations 



A promising application of the Markov chain method, as particularly pursued in Ide and Cozmanl ( 2002 ): Ide et al 



(120041) . is to impose restrictions on DAGs sampled. For example the number of edges and the degrees of the nodes can 
be restricted by simply rejecting any steps producing graphs which do not meet the requirements. More complicated 



Mark ov chains are needed for sampling poly trees and DAGs with restricted density (llde and Cozmanll2002tllde et al 



20041) but the speed of convergence to uniformity remains an issue. Although the space of restricted graphs is smaller, 
the probability of not moving for graphs on the border of the restricted set increases due to the additional rejections. 
If no recursive enumeration is possible for the restricted graphs, however, the Markov chain approach remains the 
only one available, but for restrictions which can be incorporated combinatorially, direct enumeration again provides 
a more efficient method and we discuss some examples below. 

4.1. Connected DAGs 



The main restriction imposed in llde and CozmanI (120021) : lMelan9on and Philippe! (120041) is that DAGs be weakly 
connected (admitting a path between every pair of nodes in the the und erlying undirected graph). They were also 
counted by Robinsonl ( 1973 ) and are sequence A082402 in Sloane ( 2012). The M arkov chain method can be simply 
modified so to only delete an arc if the r esulting graph is also connecte d (llde and C ozman. 20()2) or, more efficiently, 
instead to reverse its direction otherwise ( lMelan9on and PhiUppell2004l) . Checking for connectedness is of order of the 
number of arcs making the algorithm O(n^) and certainly slower since at each step either connectedness o r acyclicity 
of the proposed graph needs to be checked. The fracti on of connected DAGs tends to 1 as n increases dRobinson , 



1973t iBender et al.L Il986t iBender and Robinsoni Il988h : for n = 4 it is 82% (for smaller n we can enumerate the 
possibilities by hand), while it is > 99% and growing for n > 8. Therefore the above modifications of the Markov 
chain, though theoretically interesting, are of no practical use, as it is much more efficient to sample a DAG at random 
and reject the disconnected ones. 

When removing the k outpoints of a connected DAG the rest breaks up into a set of smaller connected DAGs, hence 
a recursive combinatorial enumeration (involving the sum over the partitions of m) is possible. Again the increase in 
algorithmic complexity makes a simple rejection step more efficient, apart from possibly for small n. 
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4.2. Restricted new connections 

Because new outpoints are added at each stage of the recursive enumeration of DAGs, restrictions on the number 
of incoming arcs the previous nodes can receive from them are the simplest to treat. Allowing up to a maximum K 
new connections {K >\ avoids only generating DAGs with no arcs) the recursions change to 



(\ m min(A', A") / , \ rmrivA:, a ; / , 
n\ I (')i 1 § e 



mm(k,K) 



(4) 



For K > k the above reduces to (|2]) while a restriction on the minimum number of new connections can be incorporated 
analogously. Such restricted DAGs can easily be counted and then sampled as in Section|3] In reconstructing the DAGs 
new arcs can be uniformly sampled from the binomial distributions in (|4]i. Retaining the limit on the arcs received 
by the previous outpoints, it is also straightforward to separately limit the connections each new one can send to the 
previous non-outpoints to a maximum of K^. The corresponding DAGs can be sampled through 



s-l I (=1 



min(k,K) 



mm{m-s,Ka) 



/=() 



m - s 
i 



(5) 



4.3. Restricted number of children 

Because they must be connected to, including previous outpoints when limiting the outgoing arcs from each new 
one is more complicated. Since arcs only originate from newly added outpoints, this means limiting the children to a 
certain K. An expression for the ways C{k, m, s, K) of linking k new outpoints may be found by subtracting from the 
ways of connecting k nodes to m (with up to K arcs each) those which leave any of the s outpoints unconnected to 



C(k, m, s, K) — 



^ I . I -^[\C{k,m - i, s - i, K), a„,k = L I 

/=0 ^ ' J /= 1 ^ ' ^ ' s=l 



(6) 



The graph is reconstructed by drawing from the C{k, m, s, K) possibilities at each step, and formula (|6]l simply suggests 
rejecting the cases which leave some of the s outpoints with no incoming link. It is however more efficient to first 
obtain the distribution of the number of arcs linking each new outpoint to the old ones when excluding the cases with 
a total < s. Samples from it where any old outpoints do not receive arcs are then rejected. Arcs to the remaining nodes 
can be drawn subsequently. For C{k, m, s, K) much smaller than the first term on the right of or when the average 
number of new arcs linking to the old outpoints falls, the acceptance ratio also drops. In this limit it may be convenient 
to first draw exactly one link to each of them and then sample the remaining arcs among all the nodes, including the 
s old outpoints. The configurations where each receives /, incoming arcs are overcounted by a factor F - HiLi h, 
so the sample should only be accepted with probability IjF. A balance of the two approaches may help minimise 
the rejection probability, but this example highlights that just knowing the number of possibilities is not sufficient for 
reconstruction, and a constructive method to generate them is preferable. A uniformly drawn DAG with a limit on the 
number of parents can be simply obtained by inverting all the arcs of a DAG with a limited number of children. 



4.4. Weighted sample 

When generating DAGs by sampling upper triangular matrices it is easy to influence their sparsity by sampling 
the entries from a Bernoulli with p + 1/2, this weighting however acts on an underlying non-uniform space of DAGs. 
In practice though it would be desirable to obtain a weighted sample from an underlying uniform distribution, which 
is achieved by making the probability of a DAG with / arcs proportional to p\\ - p)^^' and including analogous 
factors in the recursive enumeration. Each of the m - s non-outpoints may receive no links from the k new ones with 
weight (1 - p)'', one link with weight kp{l - p)*"' and so on to include all the terms in the binomial expansion of 
[(1 - p) + = 1. The possibility that no arcs link to the s old outpoints is excluded, while no arcs are allowed 
between the k new ones out of a theoretical ways of joining them, leading to the recursion 

"\l-pPJ]{^-(l-pfya,„,s (7) 
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where now a„ „ = (1 - p)^ correspond to the DAGs with no arcs out of L possibilities. We take 0*^ = 1 so that a\^\ - 1 
even for p - 1 where only one new outpoint is added at each step. Weighted DAGs can be generated as in Section[3] 
where links are now added at each step with probability p. The terms in the recursions are no longer integers in general 
(though for rational p the numerator and denominator could be stored as a pair of integers) and exact sampling can be 
obtained up to the precision limits of the computational implementation. 



5. Conclusions 

The recursive method for generating DAGs we presented has the advantage over the highly adopted strategy 
of sampling triangular adjacency matrices of producing a uniformly distributed sample and over Markov chain based 
algorithms of avoiding convergence problems and being computationally more efficient. Moreover the ideas discussed 
here can also be combined with an MCMC scheme in the inference of Bayesian graphical models. A procedure can be 
built following the combinatorial construction where moves are suggested by going back a certain number of steps in 
the sequence of adding outpoints and resampling just the new connections at that step. As they are sampled uniformly, 
such moves are reversible. To ensure full space reachability, one could introduce some probab ility of proposing a 



move uniformly amongst all DAGs or according to the standard MCMC on n etwork structures jMadigan and York , 
1995 ). The idea is similar in spirit to the sampling in the space of orders of Friedman and KoUeJ ( 20031) and has 



therefore the potential of greatly improve the mixing properties of the chain produced, with the advantage of possibly 
proposing structures which are in some sense close in the space of graphs. Since nei ther the original MCMC on 



network structures nor the order version appear to be completely satisfactory in practice dGrzegorczvk and Husmeier 



20081) the possibility outlined above seems promising. 
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